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Abstract 



This paper gives identification and estimation results for marginal effects in nonlinear panel 
models. We find that linear fixed effects estimators are not consistent, due in part to marginal 
effects not being identified. We derive bounds for marginal effects and show that they can tighten 
rapidly as the number of time series observations grows. We also show in numerical calculations 
that the bounds may be very tight for small numbers of observations, suggesting they may be 
useful in practice. We propose two novel inference methods for parameters defined as solutions 
to linear and nonlinear programs such as marginal effects in multinomial choice models. We 
show that these methods produce uniformly valid confidence regions in large samples. We give 
an empirical illustration. 



1 Introduction 



Marginal effects arc commonly used in practice to quantify the effect of variables on an outcome 
of interest. They are known as average treatment effects, average partial effects, and average 
structural functions in different contexts (e.g., see Wooldridge, 2002, Blundell and Powell, 2003). 
In panel data marginal effects average over unobserved individual heterogeneity. Chamberlain 
(1984) gave important results on identification of marginal effects in nonlinear panel data using 
control variable. Our paper gives identification and estimation results for marginal effects in 
panel data under time stationarity and discrete regressors. 

It is sometimes thought that marginal effects can be estimated using linear fixed effects, 
as shown by Hahn (2001) in an example and Wooldridge (2005) under strong independence 
conditions. It turns out that the situation is more complicated. The marginal effect may not 
be identified. Furthermore, with a binary regressor, the linear fixed effects estimator uses the 
wrong weighting in estimation when the number of time periods T exceeds three. We show 
that correct weighting can be obtained by averaging individual regression coefficients, extending 
a result of Chamberlain (1982). We also derive nonparametric bounds for the marginal effect 
when it is not identified and when regressors are either exogenous or predetermined conditional 
on individual effects. 

The nonparametric bounds are quite simple to compute and to use for inference but can 
be quite wide when T is small. We also consider bounds in semiparametric multinomial choice 
models where the form of the conditional probability given regressors and individual effects is 
specified. We find that the semiparametric bounds can be quite tight in binary choice models 
with additive heterogeneity. 

We also give theorems showing that the bounds can tighten quickly as T grows. We find 
that the nonparametric bounds tighten exponetially fast when conditional probabilities of certain 
regressor values are bounded away from zero. We also find that in a semiparametric logit model 
the bounds tighten nearly that fast without any restriction on the distribution of regressors. 

These results suggest how the bounds can be used in practice. For large T the nonparametric 
bounds may provide useful information. For small T, bounds in semiparametric models may 
be quite tight. Also, the tightness of semiparametric bounds for small T makes it feasible to 
compute them for different small time intervals and combine results to improve efficiency. To 
illustrate their usefulness we provide an empirical illustration based on Chamberlain's (1984) 
labor force participation example. 

We also develop estimation and inference methods for semiparametric multinomial choice 
models. The inferential problem is rather challenging. Indeed, the programs that characterize 
the population bounds on model parameters and marginal effects are very difficult to use for 
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inference, since the data- dependent constraints axe often infcasiblc in finite samples or under 
misspecification, which produces empty set estimates and confidence regions. We overcome these 
difficulties by projecting these data-dependent constraints onto the model space, thus producing 
an always feasible data-dependent constraint set. We then propose linear and nonlinear pro- 
gramming methods that use these new modified constraints. Our inference procedures have the 
appealing justification of targeting the true model under correct specification and targeting the 
best approximating model under incorrect specification. We develop two novel inferential pro- 
cedures, one called modified projection and another perturbed bootstrap, that produce uniformly 
valid inference in large samples. These methods may be of substantial independent interest. 

This paper builds on Honore and Tamer (2006) and Chernozhukov, Hahn, and Newey (2004). 
These papers derived bounds for slope coefficients in autoregressive and static models, respec- 
tively. Here we instead focus on marginal effects and give results on the rate of convergence 
of bounds as T grows. Moreover, the identification results in Honore and Tamer (2006) and 
Chernozhukov, Hahn, and Ncwcy (2004) characterize the bounds via linear and non-linear pro- 
grams, and thus, for the reasons we stated above, they cannot be immediately used for practical 
estimation and inference. We propose new methods for estimation and inference, which are prac- 
tical and which can be of interest in other problems, and we illustrate them with an empirical 
application. 

Browning and Carro (2007) give results on marginal effects in autoregressive panel models. 
They find that more than additive heterogeneity is needed to describe some interesting appli- 
cation. They also find that marginal effects are not generally identified in dynamic models. 
Chamberlain (1982) gives conditions for consistent estimation of marginal effects in linear corre- 
lated random coefficient models. Graham and Powell (2008) extend the analysis of Chamberlain 
(1982) by relaxing some of the regularity conditions in models with continuous regressors. 

In semiparametric binary choice models Hahn and Newey (2004) gave theoretical and simu- 
lation results showing that fixed effects estimators of marginal effects in nonlinear models may 
have little bias, as suggested by Wooldridge (2002). Fernandez- Val (2008) found that averaging 
fixed effects estimates of individual marginal effects has bias that shrinks faster as T grows than 
does the bias of slope coefficients. We show that, with small T, nonlinear fixed effects consis- 
tently estimates an identified component of the marginal effects. We also give numerical results 
showing that the bias of fixed effects estimators of the marginal effect is very small in a range 
of examples. 

The bounds approach we take is different from the bias correction methods of Hahn and 
Kuersteiner (2002), Alvarez and Arellano (2003), Woutersen (2002), Hahn and Newey (2004), 
Hahn and Kuersteiner (2007), and Fernandez- Val (2008). The bias corrections are based on large 
T approximations. The bounds approach takes explicit account of possible nonidentification for 



2 



fixed T. Inference accuracy of bias corrections will depend on T being the right size relative to 
the number of cross-section observations n, while inference for bounds does not. 

In Section 2 we give a general nonparametric conditional mean model with correlated unob- 
served individual effects and strictly exogenous regressors, and analyze the properties of linear 
estimators. Section 3 gives bounds for marginal effects in these models and results on the rate 
of convergence of these bounds as T grows. Section 4 extends the analysis to models with pre- 
determined regressors. Section 5 gives similar results, with tighter bounds, in a binary choice 
model with a location shift individual effect. Section 6 gives results and numerical examples 
on calculation of population bounds. Section 7 discusses estimation and Section 8 inference. 
Section 9 gives an empirical example. 

2 A Conditional Mean Model and Linear Estimators 

The data consist of n observations of time series Yi = (Yn^ Y-it)' and Xi = [Xn, Xix]' , for a 
dependent variable Ya and a vector of regressors Xn. We will assume throughout that {Yi,Xi), 
{i = l,...,n), are independent and identically distributed observations. A case we consider in 
some depth is binary choice panel data where Yu E {0, 1}. For simplicity we also give some 
results for binary Xn, where Xit G {0, 1}. 

A general model we consider is a nonseparable conditional mean model as in Wooldridge 
(2005). Here there is an unobserved individual effect and a function m{x,a) such that 



The individual effect Oi may be a vector of any dimension. For example, could include 
individual slope coefficients in a binary choice model, where Ya £ {0, 1}, F{-) is a CDF, and 



Such models have been considered by Browning and Carro (2007) in a dynamic setting. More 
familiar models with scalar Qj are also included. For example, the binary choice model with an 
individual location effect has 



This model has been studied by Chamberlain (1980, 1984, 1992), Hahn and Newey (2004), and 
others. The familiar linear model ElYn \ Xi,ai] = X^^(3* + ai is also included as a special case 
of equation ([T|). 

For binary Xu E {0, 1} the model of equation ([T]) reduces to the correlated random coefficients 
model of Chamberlain (1982). For other Xa with finite support that does not vary with t it is 
a multiple regression version of that model. 




(1) 



FiiYit = 1 I X„a^) = E[Yu \ Xi,ai] = F{X[^ai2 + a^l). 



VT{Yit = l\Xi,a{)= E[Yu \ Xi, a,] = F{X[^P* + a^). 
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The two critical assumptions made in equation ([T]) are that Xi is strictly exogenous con- 
ditional on a and that 771(2;, a) does not vary with time. We consider identification without 
the strict exogeneity assumption below. Without time stationarity, identification becomes more 
difficult. 

Our primary object of interest is the marginal effect given by 

J[m(x, a) — m{x, a)]Q*{da) 



1^0 



D 



where x and x are two possible values for the Xit vector, Q* denotes the marginal distribution 
of a, and D is the distance, or number of units, corresponding to x — x. This object gives the 
average, over the marginal distribution, of the per unit effect of changing x from x to x. It is the 
average treatment effect in the treatment effects literature. For example, suppose x = (xi,X2)' 
where xi is a scalar, and x = (ii, Xg)'. Then D = xi — xi would be an appropriate distance 
measure and 

J[m{xi, X2, a) — m{xi,X2, a)]Q*{da) 
/^o = 



Xl — Xi 

would be the per unit effect of changing the first component of Xa- Here one could also consider 
averages of the marginal effects over different values of X2- 

For example, consider an individual location effect for binary Yu where m{x, a) = F[x' (3* + 
a). Here the marginal effect will be 

/io = j [F{x[)* + a) - F{x[3* + a)]Q*{da). 

The restrictions this binary choice model places on the conditional distribution of Ya given Xi 
and Oj will be useful for bounding marginal effects, as further discussed below. 

In this paper we focus on the discrete case where the support of Xi is a finite set. Thus, the 
events Xn = x and Xn = x have positive probability and no smoothing is required. It would 
also be interesting to consider continuous Xn . 

Linear fixed effect estimators are used in applied research to estimate marginal effects. For 
example, the linear probability model with fixed effects has been applied when Yn is binary. 
Unfortunately, this estimator is not generally consistent for the marginal effect. There are 
two reasons for this. The first is the marginal effect is generally not identified, as shown by 
Chamberlain (1982) for binary X^. Second, the fixed effects estimator uses incorrect weighting. 

To explain, we compare the limit of the usual linear fixed effects estimator with the marginal 
effect /iQ. Suppose that Xi has finite support {X^ , X^} and let Q^{a) denote the CDF of 
the distribution of a conditional on Xi = X^. Define 



fJ'k 



j [m{x,a) -m{x,a)]Ql{da)/D, Vk = Pr(X, = X* 
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This is the marginal effect conditional on the entire time series Xi = [Xn, Xi^]' being 
equal to X^. By iterated expectations, 

K 

t^o = ^'PkfJ'k- (2) 

k=l 

We will compare this formula with the limit of linear fixed effects estimators. 

An implication of the conditional mean model that is crucial for identification is 

E[Yu \X, = X']= J , a)QUda), (3) 

where X^ = [X^, ....,X^]'. This equation allows us to identify some of the /i^, from differences 
across time periods of identified conditional expectations. 

To simplify the analysis of the linear fixed effect estimator we focus on binary Xit £ {0, 1}. 
Consider $^ from least squares on 

Yit = Xit(3 + 7i + Vit, {t = 1, T; i = 1, n), 

where each 7^ is estimated. This is the usual within estimator, where for Xi = Ylt=i ^it/T, 



E^,i^^t-X,r 



Here the estimator of the marginal effect is just j3^. To describe its limit, let r'^ = : X^ = 
1}/T and al = r''(l - r'=) be the variance of a binomial with probability r . 

Theorem 1: If equation ([7p is satisfied, {Xi,Yi) has finite second moments, and ^k=i'^k^\ ^ 
0, then 

This result is similar to Angrist (1998) who found that, in a treatment effects model in 
cross section data, the partially linear slope estimator is a variance weighted average effect. 
Comparing equations ([2]) and ^ we see that the linear fixed effects estimator converges to a 
weighted average of /z^, weighted by (t|, rather than the simple average in equation ([2]). The 
weights are never completely equal, so that the linear fixed effects estimator is not consistent 
for the marginal effect unless how /^^ varies with k is restricted. Imposing restrictions on how 

varies with k amounts to restricting the conditional distribution of Oj given Xi, which we are 
not doing in this paper. 

One reason for inconsistency of is that certain fif^ receive zero weight. For notational 
purposes let X^ = (0, ...,0)' and X^ = (1,...,!)' (where we implicitly assume that these are 
included in the support oi Xi). Note that = = so that /i^ and fij^ are not included in 
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the weighted average. The explanation for their absence is that and fij^ are not identified. 
These are marginal effects conditional on Xi equal a vector of constants, where there are no 
changes over time to help identify the effect from equation ([3]). Nonidentification of these effects 
was pointed out by Chamberlain (1982). 

Another reason for inconsistency of (3^ is that for T > 4 the weights on ^u^ will be different 
than the corresponding weights for ^q. This is because varies for k ^ {1,K} except when 
T = 2 or T = 3. 

This result is different from Hahn (2001), who found that consistently estimates the 
marginal effect. Hahn (2001) restricted the support of Xi to exclude both (0, ...,0)' or (1, 1)' 
and only considered a case with T = 2. Thus, neither feature that causes inconsistency of 
was present in that example. As noted by Hahn (2001), the conditions that lead to consistency 
of the linear fixed effects estimator in his example are quite special. 

Theorem 1 is also different from Wooldridge (2005). There it is shown that if 5j = m(l, a^) — 
m(0, ai) is mean independent of Xa — Xi for each t then linear fixed effects is consistent. The 
problem is that this independence assumption is very strong when Xit is discrete. Note that 
for r = 2, Xi2 — Xi takes on the values when Xi = (1, 1) or (0,0), —1/2 when Xi = (1,0) , 
and 1/2 when Xi = (0, 1). Thus mean independence of bi and Xi2 — Xi actually implies that 
fj.2 = ^3 and that these are equal to the marginal effect conditional on Xi £ {X^,X^}. This 
is quite close to independence of bi and Xi, which is not very interesting if we want to allow 
correlation between the regressors and the individual effect. 

The lack of identification of fii and means the marginal effect is actually not identified. 
Therefore, no consistent estimator of it exists. Nevertheless, when m{x, a) is bounded there are 
informative bounds for /^q, as we show below. 

The second reason for inconsistency of can be corrected by modifying the estimator. 
In the binary Xn case Chamberlain (1982) gave a consistent estimator for the identified effect 
fij = Ylk=2 'PkfJ-k/ Yl^=2 "Pk- The estimator is obtained from averaging across individuals the 
least squares estimates of /3j in 



Yit = XitPi + 7i + Vit, {t = 1, T; i 



1, ...,n) 




ELii^it - Xi)^ and n* 



> 0), this estimator takes the form 




This is equivalent to running least squares in the model 



Yit = f3kXit + 7fc + Vit 



(5) 
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for individuals with Xi = X , and averaging over k weighted by the sample frequencies of 

The estimator /? of the identified marginal effect fij can easily be extended to any discrete 
Xit. To describe the extension, let da = l{Xit = x),dit = l{Xit = x), ri = Ylt=idit/T,fi = 
YlJ=i dit/T, and n* = Yll^=i ^i^i > 0)1 (^i > 0). The estimator is given by 

^ = ^ E > 0)l(r-. > 0)[^%^ - EL^]. 

i=l 

This estimator extends Chamberlain's (1982) estimator to the case where Xu is not binary. 

To describe the limit of the estimator /3 in general, let /C* = {k :there is t and t such that 
X^ = X and X^ = x}. This is the set of possible values for Xi where both x and x occur for 
at least one time period, allowing identification of the marginal effect from differences. For all 
other values of k, either x or x will be missing from the observations and the marginal effect 
will not be identified. In the next Section we will consider bounds for those effects. 

Theorem 2: If equation (OP is satisfied, (Xi, Yi) have finite second moments and ^^(zjQ* Vk > 
0, then 

where VI = Vk/J2keJC'T^k- 

Here /? is not an efficient estimator of fij for T > 3, because /? is least squares over time, 
which does not account properly for time series heteroskedasticity or autocorrelation. An efficient 
estimator could be obtained by a minimum distance procedure, though that is complicated. Also, 
one would have only few observations to estimate needed weighting matrices, so its properties 
may not be great in small to medium sized samples. For these reasons we leave construction of 
an efficient estimator to future work. 

To see how big the inconsistency of the linear estimators can be we consider a numerical 
example, where Xu G {0, 1} is i.i.d across i and t, Pr(Xjt = 1) = px, t]it is i-i-d. A^(0, 1), 

T 

Yu = l{Xu + a, + 7],t > 0), at = Vt{X, - px)/VPxil - Px), X, = Y, ^it/T. 

t=i 

Here we consider the marginal effect for x = l,x = 0,D = 1, given by 



j[<^>{l + a)-^{a)]Q*{da) 



Table 1 and Figure 1 give numerical values for — /^o)//"o (/^ ~ ^o)//^o several values 
of T and px, where = plim (3^ and (3 = plim (3. 

We find that the biases (inconsistencies) can be large in percentage terms. We also find that 
biases are largest when px is small. In this example, the inconsistency of fixed effects estimators 
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of marginal effects seems to be largest when the regressor values are sparse. Also we find that 
differences between the limits of (3 and (3^ are larger for larger T, which is to be expected due 
to the weights differing more for larger T. 

3 Bounds in the Conditional Mean Model 

Although the marginal effect /Uq is not identified it is straightforward to bound it. Also, as we 
will show below, these bounds can be quite informative, motivating the analysis that follows. 
Some additional notation is useful for describing the results. Let 

m!l = E[Yit I X, = X'']/D 

be the identified conditional expectations of each time period observation on Yu conditional on 
the /c*'' support point. Also, let A(a) = [m{x,a) — m{x,a)] /D. The next result gives identifi- 
cation and bound results for /i^, which can then be used to obtain bounds for //g. 

Lemma 3: Suppose that equation (pP is satisfied. If there is t and t such that = x and 
X^ = X then 

Suppose that Bg < m{x,a)/D < B^. If there is i such that X^ = x then 

m\ -Bu<Hk< m\ - Bi. 
Also, if there is such that X^ = x then 

Be-m^ < <Bu-fh^. 

Suppose that A(a) has the same sign for all a. Then if for some k there is i and t such that 
X^ = X and X^ = x, the sign of A(a) is identified. Furthermore, if A(a) is positive then the 
lower hounds may he replaced by zero and if A (a) is negative then the upper bounds may be 
replaced by zero. 

The bounds on each fi/. can be combined to obtain bounds for the marginal effect /ig. Let 

iC = {k : there is i such that X^ = x but no t such that X^ = x}, 
iC = {k : there is i such that X^ = x but no i such that X^ = x}. 

Also, let V^{x) = Pr(Xj : Xn ^ x and Xa ^ x Vt). The following result is obtained by 
multiplying the k^^ bound in Lemma 3 by Vk and summing. 
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Theorem 4; If equation (CP is satisfied and Bi < m{x,a)/D < Bu then /i^ < /Ug < for 

fcG/C k&K. keK* 

keic keK k&K* 

If A(a) has the same sign for all a and there is some k* such that X^* = x and X^* = x, the 
sign of is identified, and if > (< 0) then /x^ ( jj,^) can he replaced by J2keic* '^kl^k 

An estimator can be constructed by replacing the probabilities by sample proportions = 
= X^)/n and pO = 1 - Y^keic ^k - Ekelc Pk " T^keic- ^k, and each by 

n n 

m'^ = l{n'^ > 0) ^ 1{X, = X^)Yu/n\n^ = ^ 1{X, = X^). 

i=l i=l 

Estimators of the upper and lower bound respectively are given by 

A, = P'>{Be-B^) + Y,Pk{m''i-Bu) + Y,PkiBi-m'^) + {n*/n)p, 

fce/c keic 
ii^ = P\Bu-B,) + Y,Pk{m\-B,) + Y,_Pk{Bu-rhl) + {n*/n)(3. 

keK keK. 

The bounds jj,f> and fi^ will be jointly asymptotically normal with variance matrix that can be 
estimated in the usual way, so that set inference can be carried out as described in Chernozhukov, 
Hong, and Tamer (2007), or Beresteanu and Molinari (2008). 

As an example, consider the binary X case where Xn E {0, 1}, x = 1 , and x = 0. Let X^ 
denote a T x 1 unit vector and X^ be the T x 1 zero vector, assumed to lie in the support of 
Xi. Here the bounds will be 

l^i = VKirhf -Bu)+Vi{Bi-fhl)+ ^fc^fe' (6) 

l<k<K 

^„ = VKirhf -Be)+Vi{Bu-fh})+ ^ Vkf^k- 

l<k<K 

It is interesting to ask how the bounds behave as T grows. If the bounds converge to /ig as 
T goes to infinity then fi^ is identified for infinite T. If the bounds converge rapidly as T grows 
then one might hope to obtain tight bounds for T not very large. The following result gives a 
simple condition under which the bounds converge to /ig as T grows. 

Theorem 5: // equation ^\) is satisfied, Bi < m{x,a)/D < Bu Xi = {Xn, Xi2, ...) is 
stationary and, conditional on ai, the support of each Xa is the marginal support of Xa and 
Xi is ergodic. Then — > fj,Q and ^„ — > ^g ^ — *■ oo. 



9 



This result gives conditions for identification as T grows, generalizing a result of Chamberlain 
(1982) for binary Xn- In addition, it shows that the bounds derived above shrink to the marginal 
effect as T grows. The rate at which the bounds converge in the general model is a complicated 
question. Here we will address it in an example and leave general treatment to another setting. 
The example we consider is that where Xa £ {0, 1}. 

Theorem 6: // equation is satisfied, Bi < m{x,a)/D < Bu and Xi is stationary and 
Markov of order J conditional on a-i, then for pj = Fr{Xit = 0|Xj^t_i = • • • = Xi^t-J = 0, ««) 
and pf = Pr(Xji = l\Xi^t-i = ■■■ = X^-j = 1, ai) 

max{|/., - - /zol} < {Bu - B,)E[{p]f-' + {pff-']. 

If there is e > such that pj < 1 — e and pf < 1 — e then 

max{|/i£ - /iol, |/^« - /^ol} < {Bu - Bi>)2{l - e)'^"-^. 

If there is a set A of Oi such that Pr(^) > and, either Fi{Xii = • • • = Xij = | Oj) > for 
ai £ A and pj = 1 for all ai (zA, or Pr(Xji = • • • = Xij = 1 | Qj) > for ai A and 
pf = 1, then Hf, ^ fiQ or fi^ ^ fi^- 

When the conditional probabilities that Xit is zero or one are bounded away from one the 
bounds will converge at an exponential rate. We conjecture that an analogous result could be 
shown for general X^. The conditions that imply that one of the bounds does not converge 
violates a hypothesis of Theorem 5, that the conditional support of Xn equals the marginal 
support. Theorem 6 shows that in this case the bounds may not shrink to the marginal effect. 

The bounds may converge, but not exponentially fast, depending on P{ai) and the distri- 
bution of Oj. For example, suppose that Xa = l(aj — en > 0), Oj ~ N{0, 1), en ~ X{0, 1), with 
ai i.i.d. over i, and en i.i.d. over t and independent of Oj. Then 



Vk = E[<^{aif] = j ^{af(j){a)da 



r^(a 



T+ 1 



+ 00 



1 



1 



In this example the bounds will converge at the slow rate 1/T. More generally, the convergence 
rate will depend on the distribution of p,^ and pf . 

It is interesting to note that the convergence rates we have derived so far depend only on 
the properties of the joint distribution of {Xi,ai), and not on the properties of the conditional 
distribution of Yi given {Xi,ai). This feature of the problem is consistent with us placing no 
restrictions on m{x, a). In Section 5 we find that the bounds and rates may be improved when 
the conditional distribution of Yi given {Xit,ai) is restricted. 
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4 Predetermined Regressors 



The previous bound analysis can be extended to cases where the regressor Xa is just prede- 
termined instead of strictly exogenous. These cases cover, for example, dynamic panel models 
where Xn includes lags of Yn. To describe this extension let Xi{t) = [Xn^ ...^Xa]' and suppose 
that 

E[Y,t\X,{t),ai] = ra(Xiu a^, (t = 1, T). (7) 

For example, this includes the heterogenous, dynamic binary choice model of Browning and 
Carro (2007), where Ya G {0, 1} and Xit = Yi^t-i- 

As before, the marginal effect is given by = J[m{x, a)—m{x, a)]Q*{da) / D for two different 
possible values x and x of the regressors and a distance D. Also, as before, the marginal effect 
will have an identified component and an unidentified component. The key implication that is 
used to obtain the identified component is 

E[Yit\Xi{t) = X{t)] = J m{Xt,a)Q*{da\Xi{t) = X{t)), (8) 

where X{t) = [Xi,...,XtY. 

Bounds are obtained by partitioning the set of possible Xi into subsets that can make 
use of the above key implication and a subset where bounds on m{x,a)/D are applied. The 
key implication applies to subsets of the form Af*(x) = {X : Xt = x, Xg ^ x < t}, 
that is a set of possible Xi vectors that have x as the t*^ component and not as any previous 
components. The bound applies to the same subset as before, that where x never appears, given 
by ^{x) = {X : Xt ^ x Vt}. Together the union of X^{x) over all t and X(x) constitute a 
partition of possible X vectors. Let V{x) = Pr(Xj E ^(x)) be the probability that none of the 
components of Xi is equal to x and 

T 

6o = E[Y^{1{X, G X\x)) - l{Xi G X\x))}Yu]/D. 

Then the key implication and iterated expectations give 

Theorem 7: If equation ^ is satisfied and B£ < m(x,a)/D < then /^^ < /Uq < /i„ for 
^xe = So + BiP{x)-BuP{x), fi^ = 6o + BuP{i) - BeP{x). (9) 

As previously, estimates of these bounds can be formed from sample analogs. Let P{x) = 
TJU e X{x))/n and 

n T 

6 = 5^ G X\x)) - 1{X, G X\x))]Yitl{nD). 

i=l t=l 
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The estimates of the bounds are given by 



fie = S + BiP{x) - BuP{x), fiu = l + BuP{x) - BtP{x). 

Inference using these bounds can be carried out analogously to the strictly exogenous case. 

An important example is binary Ya £ {0, 1} where Xu = Yi^t-i- Here B^ = 1 and Bi = 0, 
so the marginal effect is 




= l,a)-PI{Y^t = m,t-i = 0,a)]Q*{da), 



i.e., the effect of the lagged Yi^t-i on the probability that Yit = 1, holding Oj constant, averaged 
over ttj. In this sense the bounds provide an approximate solution to the problem considered 
by Feller (1943) and Heckman (1981) of evaluating duration dependence in the presence of 
unobserved heterogeneity. In this example the bounds estimates are 

fie = 6-P{0), /i„ = 5 + P(l). (10) 

The width of the bounds is P{0) +P(1), so although these bounds may not be very informative 
in short panels, in long panels, where P{0) + P{1) is small, they will be. 

Theorems 5 and 6 on convergence of the bounds as T grows apply to fi£ and /i„ from equation 
([9]), since the bounds have a similar structure and the convergence results explicitly allow for 
dependence over time of Xit conditional on a^. For example, for Ya £ {0, 1} and Xu = 
equation ([7|) implies that Yit is Markov conditional on Oj with J = 1. Theorem 5 then shows 
that the bounds converge to the marginal effect as T grows if < Fr{Yit = l\ai) < 1 with 
probability one. Theorem 6 also gives the rate at which the bounds converge, e.g. that will be 
exponential if Pr(yj( = l|yi^t_i = l,ai) and Pr(yif = 0\Yi^t-i = 0,ai) are bounded away from 
one. 

It appears that, unlike the strictly exogenous case, there is only one way to estimate the 
identified component Sq. In this sense the estimators given here for the bounds should be 
asymptotically efficient, so there should be no gain in trying to account for heteroskedasticity 
and autocorrelation over time. Also, it does not appear possible to obtain tighter bounds when 
monotonicity holds, because the partition is different for x and x 

5 Semiparametric Multinomial Choice 

The bounds for marginal effects derived in the previous sections did not use any functional 
form restrictions on the conditional distribution of given {Xi,ai). If this distribution is 
restricted one may be able to tighten the bounds. To illustrate we consider a semiparametric 
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multinomial choice model where the conditional distribution of Yi given {Xi, ai) is specified and 
the conditional distribution of given Xi is unknown. 

We assume that the vector of outcome variables can take J possible values Y^, . . . , Y'^ . 
As before, wc also assume that Xi has a discrete distribution and can take K possible values 
X^, . . . ,X^. Suppose that the conditional probability of Y^ given (Xj,aj) is 

Pr(Yi = Y^ I Xi = X\ ai) = L{Y^ \ X\ ai, (5*) 

for some finite dimensional (3* and some known function C. Let denote the unknown condi- 
tional distribution of given X^ = X^. Let Vjk denote the conditional probability of Yi = Y^ 
given Xi = X^. We then have 

Vjk = J /:(y^ \X^a,P*)Ql{da),{j = l,...,J;k = l,...,K), (11) 

where Vjk is identified from the data and the right hand side are the probabilities predicted 
by the model. This model is semiparametric in having a likelihood £. that is parametric and 
conditional distributions Q1 for the individual effect that are completely unspecified. In general 
the parameters of the model may be set identified, so the previous equation is satisfied by a set 
of values B that includes /3* and a set of distributions for that includes Q^. for k = 1, ...,K. 
We discuss identification of model parameters more in detail in next section. Here we will focus 
on bounds for the marginal effect when this model holds. 

For example consider a binary choice model where Yn G {0,1}, Yii,...,YiT are independent 
conditional on {Xi,ai), and 

FiiYit = l\Xi, ai, (5) = FiXltP + ai) (12) 

for a known CDF F{-). Then each Y^ consists of a T x 1 vector of zeros and ones, so with 
J = 2^ possible values. Also, 

T 

CiYi I Xi,ai,P) = llFiXltP + ai)''»[l-F{X'itP + ai)]'-''». 

t=i 

The observed conditional probabilities then satisfy 

^^•^ = / |n ^(^fr + <^f* [1 - ^(^fr + I Ql {doc) , {j = 1, 2^; k = l, K). 

As discussed above, for the binary choice model the marginal effect of a change in X^ from 
X to X, conditional on Xi = X^, is 

/Xfe = I [F [x'p* + a)-F {x'(3* + a)]Ql{da), (13) 
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for a distance D. This marginal effect is generally not identified. Bounds can be constructed 
using the results of Section 3 with Bi = and = 1, since m{x,a) = F{x'l3* + a) G [0,1]- 
Moreover, in this model the sign of A(a) = D^^[F{x' (3* + a) — F{x'/3* + a)] does not change 
with a, so we can apply the result in Lemma 3 to reduce the size of the bounds. These bounds, 
however, are not tight because they do not fully exploit the structure of the model. Sharper 
bounds are given by 

/x^= mmp^B,Q,D-^ J[F{x'P + a)-F{x'P + a)]Qk{da) 
s.t. Vjk = />C (y^^ I X\a,P) Qk (da) Vj, 

and 

77fc= maj^/3eB,Q,,D-'^ J[F{x'P + a)-F{x'(3 + a)]Qk{da) 
s.t. Vjk = JC{Y^ I X\a,p) Qk {da) Vj. 
In the next sections we will discuss how these bounds can be computed and estimated. Here we 
will consider how fast the bounds shrink as T grows. 

First, note that since this model is a special case of (more restricted than) the conditional 
mean model, the bounds here will be sharper than the bounds previously given. Therefore, the 
bounds here will converge at least as fast as the previous bounds. Imposing the structure here 
does improve convergence rates. In some cases one can obtain fast rates without any restrictions 
on the joint distribution of Xi and ctj. 

We will consider carefully the logit model and leave other models to future work. The logit 
model is simpler than others because /3* is point identified. In other cases one would need to 
account for the bounds for To keep the notation simple we focus on the binary X case, 
Xit € {0, 1}, where x = 1 and x = 0. We find that the bounds shrink at rate T"*" for any finite 
r, without any restriction on the joint distribution of X^ and a^. 

Theorem 8: For k = 1 or k = K and for any r > 0, as T — > oo, 

T^k-HLk = 0{T~''). 

Fixed effects maximum likelihood estimators (FEMLEs) are a common approach to estimate 
model parameters and marginal effects in multinomial choice panel models. Here we compare 
the probability limit of these estimators to the true value of the corresponding parameters. The 
FEMLE treats the realizations of the individual effects as parameters to be estimated. The 
corresponding population problem can be expressed as 

K J 

P = argmax^ E ^fc E ^i'^ ^ I ' "i*^ ' 1^) ' (^6) 
k=i j=i 

where 

ajkiP) = argmax„ logC (y^ \ X\ a,/?) , Vj, k. (17) 
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Here, we first concentrate out the support points of the conditional distributions of a and then 
solve for the parameter /?. 

Fixed effects estimation therefore imposes that the estimate of has no more than J points 
of support. The distributions implicitly estimated by FE take the form 



Vjk, for a = ajkiPy, 
0, otherwise. 



(18) 



The following example illustrates this point using a simple two period model. Consider a two- 
period binary choice model with binary rcgrcssor and strictly increasing and symmetric CDF, 
i.e., F{—x) = 1 — F{x). In this case the estimand of the fixed effects estimators are 



-DO, ifFJ = (0,0); 

-P{X^ + X|)/2, if = (1, 0) or = (0, 1); 

DO, ify^ = (l,l), 



(19) 



and the corresponding distribution for a has the form 

Qit^(a) = 



Pr{y = (0,0) I X'^}, if a = -co; 

Pr{F = (1, 0) I X^} + Pr{y = (0, 1) | X*}, if a = -(3{X^ + X^)/2; 
Pr{y = (1,1) I X'^}, if a = DO. 



(20) 



This formulation of the problem is convenient to analyze the properties of nonlinear fixed 
effects estimators of marginal effects. Thus, for example, the estimator of the marginal effect /Xj. 
takes the form: 

Afc(/3) = J [F{x'(5 + a) - F{x' (5 + a)]Qk0{a). (21) 

The average of these estimates across individuals with identified effects is consistent for the 
identified effect fij when X is binary. This rcsTilt is shown here analytically for the two-period 
case and through numerical examples for T > 3. 



Theorem 9: // F'{x) > 0, F{-x) = 1 - F{x), and YlkS^^Vk > 0, then, for 



K-l 



k=2 



For not identified effects the nonlinear fixed effects estimators are usually biased toward zero, 
introducing bias of the same direction in the fixed effect estimator of the average effect if 
there are individuals with not identified effects in the population. To see this consider a logit 



15 



model with binary regressor, = (0,0), X = and i = 1. Using that /? = 2(3* (Andersen, 
1973) and F'{x) = F{x){l - F{x)) < 1/4, we have 



h0) 



< 



F0) - F(0) [P{Y = (1, 0) I X''} + P{Y = (0, 1) I X''}] 
P/2 [ F{a)F{l - a)Qk{da) = E[l3*F'{xf)* + a) \ X = X^' 



This conjecture is further explored numerically in the next section. 



6 Characterization and Computation of Population Bounds 

6.1 Identification Sets and Extremal Distributions 

We will begin our discussion of calculating bounds by considering bounds for the parameter 
(3. Let Cjk{P,Qk) ■= J C{y^ I X^ (3) Qk{da) and Q := {Qi, . . . ,Qk)- For the subsequent 
inferential analysis, it is convenient to introduce a quadratic loss function 



3,k 



(22) 



where iOjkiV) are positive weights. By the definition of the model in (jlip . we can see that 
{I3*,Q*) is such that 

r(/3,Q;T') >T(r,Q*;^) = 0, 

for every (/3,Q). For T{I3;V) := iufg r(/3, Q; P), this implies that 

T{p;V) > T{(3*;V) = 0, 
for every /3. Let B be the set of /?'s that minimizes T{j3;V), i.e., 

B:= {/3:T(/3;P) =0}. 

Then we can see that P* £ B. In other words, (3* is set identified by the set B. 

It follows from the following lemma that one needs only to search over discrete distributions 
for Q to find B. Note that 

Lemma 10: // the support C of a, is compact and C (Y^ \ X^,a,f3^ is continuous in a for 
each (3, j, and k, then, for each P £ B and k, a solution to 

J 

Qki3 = arginin ^^^^^(T^) CPjk - Cjk{P,Qk)f 

Qk — 



exists that is a discrete distribution with at most J points of support, and Cjk{f3,Qki3) = 'Pjk, 
Vj, k. 
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Another important result is that the bounds for marginal effects can be also found by search- 
ing over discrete distributions with few points of support. We will focus on the upper bound Jcj^ 
defined in (jlSp : an analogous result holds for the lower bound ^u^ in ()14p . 

Lemma 11: If the support C of Oi is compact and C (Y^ \ X^,a,l3) is continuous in a for 
each P, j, and k, then, for each P £ B and k, a solution to 

Qk/3 = argmaxD-i / [F{x'P + a) - F{x'(3 + a)]Qk (da) s.t. Cjk{f3, Qk) = Vjk, Vj 
Qk J 

can be obtained from a discrete distribution with at most J points of support. 



6.2 Numerical Examples 

We carry out some numerical calculations to illustrate and complement the previous analytical 
results. We use the following binary choice model 

Yit = l{XuP* + ai + eit>0}, (23) 

with Eit i.i.d. over t normal or logistic with zero mean and unit variance. The explanatory 
variable Xn is binary and i.i.d. over t with px = Pr{Xj( = 1} = 0.5. The unobserved individual 
effect ai is correlated with the explanatory variable for each individual. In particular, we generate 
this effect as 

ai = an + a2i, 

where an is a random component independent of the regressors with 



Pr{aii = am} = < 



$ ( ""'+^/"'" ) , for am = -3.0; 



^ I a^+i+arr, ] _ $ [ <^m+a,^^i j ^ fQj. ^ _2 8^ _2.6, 2.8; 



"'"+"'"-1 ) ^ for am = 3.0; 



as in Honore and Tamer (2006), and a2i = VT{Xi — Px)/ \/px{^ — Px) with Xi = Yl't=i ^it/T. 

Identified sets for parameters and marginal effects are calculated for panels with 2, 3, and 4 
periods based on the conditional mean model of Section 2 and semiparametric logit and probit 
models. For logit and probit models the sets are obtained using a linear programming algorithm 
for discrete regressors, as in Honore and Tamer (2006). Thus, for the parameter we have that 
B = {(5: L{(3) = 0}, where 

K J K 

L{(3) = min + (24) 

' k=l j=l k=l 

'>^jk + Em=l ^km^ {Y^ I ^'^ , "m, /?) = Vjk Vj, k, 

^fc + J2m=l ^km = 1 VA;, 

Vjk > 0, Wfe > 0, TTkm > Vj, k, m. 
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For marginal effects, see also Chernozhukov, Hahn, and Newey (2004), we solve 



M 



max / min 



O^m ) 



F{x'l3 + am)] 



(25) 



m=l 




f3) = Vjk 




The identified sets are compared to the probability limits of linear and nonlinear fixed effects 
estimators. 

Figure 2 shows identified sets for the slope coefficient /3* in the logit model. The figures 
agree with the well-known result that the model parameter is point identified when T > 2, e.g., 
Andersen (1973). The fixed effect estimator is inconsistent and has a probability limit that is 
biased away from zero. For example, for T = 2 it coincides with the value 2/3* obtained by 
Andersen (1973). For T > 2, the proportionality /3 = cf3* for some constant c breaks down. 

Identified sets for marginal effects are plotted in Figures 3-7, together with the probability 
limits of fixed effects maximum likelihood estimators (Figures 4-6) and linear probability model 
estimators (Figure 7)0 Figure 3 shows identified sets based on the general conditional mean 
model. The bounds of these sets are obtained using the general bounds (G-bound) for binary 
regressors in ([6]), and imposing the monotonicity restriction on A (a) in Lemma 3 (GM-bound). 
In this example the monotonicity restriction has important identification content in reducing 
the size of the bounds. 

Figures 4-6 show that marginal effects are point identified for individuals with switches in 
the value of the regressor, and nonlinear fixed effects estimators are consistent for these effects. 
This numerical finding suggests that the consistency result for nonlinear fixed effects estimators 
extends to more than two periods. Unless /3* = 0, marginal effects for individuals without 
switches in the regressor are not point identified, which also precludes point identification of 
the average effect. Nonlinear fixed effects estimators are biased toward zero for the unidentified 
effects, and have probability limits that usually lie outside of the identified set. However, both 
the size of the identified sets and the asymptotic biases of these estimators shrink very fast with 
the number of time periods. In Figure 7 we see that linear probability model estimators have 
probability limits that usually fall outside the identified set for the marginal effect. 

For the probit. Figure 8 shows that the model parameter is not point identified, but the size 
of the identified set shrinks very fast with the number of time periods. The identified sets and 
limits of fixed effects estimators in Figures 9-13 are analogous to the results for logit. 

^We consider the version of the linear probability model that allows for individual specific slopes in addition 
to the fixed effects. 
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7 Estimation 



7.1 Minimum Distance Estimator 

In multinomial choice models with discrete regressors the complete description of the DGP is 
provided by the parameter vector 

(n',n^')', n = (iijk,j = i,...,J,k = i,...,K), u"" = {Uk,k = i,...,K), 

where 

Ujk = Pr(y = Y^\X = X''), Uk = Pr(X = X'^). 

We denote the true value of this parameter vector by (V, V^')', and the nonparametric empirical 
estimates by {P\P^')'. As it is common in regression analysis, we condition on the observed 
distribution of X by setting the true value of the probabilities of X to the empirical ones, that 
is, 

jjX ^ pX jyX ^ pX 

Having fixed the distribution of X, the DGP is completely described by the conditional choice 
probabilities IT. 

Our minimum distance estimator is the solution to the following quadratic problem: 

B„ = 1^ e B : T(/?; P) < minT(/?; P) + e„| , 

where B is the parameter space, e„ is a positive cut-off parameter that shrinks to zero with the 
sample size, as in Chernozhukov, Hong, and Tamer (2007), and 

r /■ 1 ^ 

where Q is the set of conditional distributions for a with J points of support for each covariate 
value index that is, for § the unit simplex in M'^ and ^aum Dirac delta function at a^^, 

Q = < Q := (Qi, ...,Qk) ■ Qk{da) = ^ T^km5akmi'^)da,{aki, • • • ,ajtj) G C, (tt/ci, . . . , 7rjkj) G S,Vfc 

L m=l > 

Here wc make use of Lemma 10 that tells us that we can obtain a maximizing solution for Qk as 
a discrete distribution with at most J points of support for each k. Alternatively, we can write 
more explicitly 

2 



r(/3;P)= min Y.^jk{P) 

7rfc=(7rfci,...,7rfej)eS,Vfc ^'^^ 



J 



m=l 



(26) 
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In the appendix we give a computational algorithm to solve this problem. 

For estimation and inference it is important to allow for the possibility that the postulated 
model is not perfectly specified, but still provides a good approximation to the true DGP. In 
this case, when the conditional choice probabilities are misspecified, Bn estimates the identified 
set for the parameter of the best approximating model to the true DGP with respect to a chi- 
square distance. This model is obtained by projecting the true DGP V onto H, the space of 
conditional choice probabilities that are compatible with the model. In particular, the projection 
V* corresponds to the solution of the minimum distance problem: 

V* = U*{V) G aTgmmW{U,V), W{U,V) =^Wjk{V){Vjk - ^jkf, (27) 

where 

J 

m=l 

("fci, • • • ,afcj) e C, (vTfci, . . .,iTkj) G S,/3 G M,\/(j,k)}. 

To simplify the exposition, we will assume throughout that V* is unique. Of course, when P G H, 
then V* = V and the assumption holds triviallyl^ The identified set for the parameter of the 
best approximating model is 

B* = !^peM:3Qeq s.t. j c(y^ \X'',a,(3)dQkia)=V*k,yij,k)Y 

i.e., the values of the parameter P that are compatible with the projected DGP V* = {Vjf.,j = 
l,...,J,k = 1,...,K). Under correct specification of the semiparametric model, we have that 
V* =V and B* = B. 

We shall use the following assumptions. 

Assumption 1: (i) The function F defined in il2\) is continuous in (a,/3), so that the 
conditional choice probabilities Cjk{a,(3) = C (Y^ \ X^,a,(3^ are also continuous for all {j,k); 
(ii) B* <^M for some compact set M; (Hi) ai has a support contained in a compact set C; and 
(iv) the weights ujjk{P) are continuous in P at V, and < ujjk{V) < oo for all (j, k). 

Assumption l(i) holds for commonly used semiparametric models such as logit and probit 
models. The condition l(iv) about the weights is satisfied by the chi-square weights oJjkiV) = 
Vk/Vjk ifT^ifc >0, V(j,fc). 

^Otherwise, the assumption can be justified using a genericity argument similar to that presented in Newey 
(1986), see Appendix. For non-generic values, we can simply select one element of the projection using an 
additional complete ordering criterion, and work with the resulting approximating model. In practice, we never 
encountered a non-generic value. 
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In some results, we also employ the following assumption. 

Assumption 2: Every j3* € B* is regular at V in the sense that, for any sequence n„ — >■ V , 
there exists a sequence /3„ G argmin^gBT'(/3,n„) such that (3^^ (3* . 

In a variety of cases the assumption of regularity appears to be a good one. First of all, the 
assumption holds under point identification, as in the logit model, by the standard consistency 
argument for maximum likelihood/minimum distance estimators. Second, for probit and other 
similar models, we can argue that this assumption can also be expected to hold when the true 
distribution of the individual effect ai is absolutely continuous, with the exception perhaps of 
very non-regular parameter spaces and non-generic situations. 

To explain the last point, it is convenient to consider a correctly specified model for sim- 
plicity. Let the vector of model conditional choice probabilities for (y^, ....,Y^) be Ck '■= 
{Cik{a,l3), ...,jCjk{a,/3)y . Let TkiP) := {Ck {a, (3) : a e C} and let Mk (/?) be the convex hull 
of Tk (/?). In the case of probit the specification is non-trivial in the sense that Mk {(3) possesses 
a non-empty interior with respect to the J dimensional simplex. For every P* e B and some Qj^, 
we have that Cjk{l3*,Q*) = Vjk for all {j,k), that is, {Vik, ■■■,Vjk) e Mk{(3*) for all k. More- 
over, under absolute continuity of the true Q* we must have ("Pife, 7-*jfc) G interior A4k (Pq) 
for all k, where /3q G S is the triic value of /?. Next, for any (3* in the neighborhood of (3q, we 
must have (Vik, ■■■,Vjk) G interior Ai^ (/?*) for all k, and so on. In order for a point f3* to be 
located on the boundary of B we must have that (Vik, ■■■■,'Pjk) £ dM.k iP*) for some k. Thus, if 
the identified set has a dense interior, which we verified numerically in a variety of examples for 
the probit model, then each point in the identified set must be regular. Indeed, take first a point 
P* in the interior of B. Then, for any sequence n„ V, we must have {Hik, Hj^) G Mk iP*) 
for all k for large n, so that T(/3*; n„) = for large n. Thus, there is a sequence of points in 
argminggB T(P; n„) converging to P* . Now take a point P** on the boundary of B, then for each 
e > 0, there is P* in the interior such ||/3* — /5**|| < e/2 and such that there is a sequence of points 
/3„ in argmin^gB n„) and a finite number n(e) such that for all n > ra(e), \\P* — /3„|| < e/2. 
Thus, for all n > n(e), \\P** — < e. Since e > is arbitrary, it follows that /?** is regular. 

We can now give a consistency result for the quadratic estimator. 

Theorem 12: If Assumptions 1 holds and e„ oc log n/n then 

dH{Bn,B*)=Oril), 

where dn is the Hausdorff distance between sets 
dn {Bn, B*) = max 



sup mf sup inf - ^| 
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Under Assumption 2 the result holds for e„ = 0. 

Moreover, under Assumption 1 the model-predicted probabilities are consistent, for any /?„ S 
Bn, and each j and k, 

J 

P^k = E ^kmiPn)^ I X'^,aUPn),Pn) V*,, (28) 

m=l 

where {TTkmiPn)y(^kmiPn)j Vfc, m} is a solution to the minimum distance problem \26\) for any 
e„ 0, where we assume that V* is unique. 

7.2 Marginal Effects 

We next consider the problem of estimation of marginal effects, which is of our prime interest. An 
immediate issue that arises is that we can not directly use the solution to the minimum distance 
problem to estimate the marginal effects. Indeed, the constraints of the linear programming 
programs for these effects in (j25p many not hold for any /? G Bn when V is replaced by P 
due to sampling variation or under misspecification. In order to resolve the infeasibility issue, 
we replace the nonparametric estimates Pjk by the probabilities predicted by the model P*/^ 
as defined in (|28p . and we re-target our estimands to the marginal effects defined in the best 
approximating model. 

To describe the estimator of the bounds for the marginal effects, it is convenient to introduce 
some notation. Let 

//*(/3,n) = min^^ D-^J2i=i[FiS:'P + akm)-F{x' 13 + akm)]TT km 

Ofc = {aki, ■ ■ ■ ,o:kj) G C, 
TTfc = (TTfci, . . . ,7rkj) G S, 



(29) 



(30) 



and 

s-t. n*;, = Y.i^i ^ I X\akm, 0) T^km Vi, 
Ofc = ("fci, • • • jttfcj) G C, 

TTfc = (vTfci, . . . ,7rfcj) G S, 

where 11* = (11^^, j = l,...,J,k = 1,...,K) denotes the the projection of 11 onto H, i.e., 11* = 
n*(n) as defined in (j27p . Thus, the upper and lower bounds on the true marginal effects of the 
best approximating model take the form: 
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Under correct specification, these correspond to the lower and upper bounds on the marginal 
effects in and (fT5|) . We estimate the bounds by 

tl*k = Fi? ^) ' ^fc = 'P*k ^) • 

Theorem 13: If Assumptions 1 is satisfied and e„ oc logn/n then 

tl*k= HlI + '^'P^^'I' f^k=K + or{^)- 
Under Assumption 2 the result holds for e„ = 0. 



8 Inference 

8.1 Modified Projection Method 

The following method projects a confidence region for conditional choice probabilities onto a 
simultaneous confidence region for all possible marginal efi'ects and other structural parameters. 
If a single marginal efi'ect is of interest, then this approach is conservative; if all (or many) 
marginal effects are of interest, then this approach is sharp (or close to sharp). In the next 
section, we will present an approach that appears to be sharp, at least in large samples, when a 
particular single marginal effect is of interest. 

It is convenient to describe the approach in two stages. 

Stage 1. The nonparametric space Hat of conditional choice probabilities is the product of 
K simplex sets S of dimension J, that is, Hjv = S^- Thus we can begin by constructing a 
confidence region for the true choice probabilities V by collecting all probabilities 11 S Hjv that 
pass a goodness-of-fit test: 

CRi-aiV) = {n e Hjv : W{U,P) < ci_«(x?,(j_i))} > 

where ci_a(x^(-j_]^p is the (1 — a)-quantile of the x|-(j_i) distribution and W is the goodness- 
of-fit statistic: 

■ I ^'-jk 

],k 

Stage 2. To construct confidence regions for marginal effects and any other structural pa- 
rameters we project each 11 G Ci?i_Q,(P) onto H, the space of conditional choice probabilities 
that are compatible with the model. We obtain this projection 11* (11) by solving the minimum 
distance problem: 

fn — n 

n*(n) = argminVF(n,n), W{Ii,Ii) = nS^ Pk^-^ (31) 

neH ^ Hj-fc 
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The confidence regions are then constructed from the projections of all the choice probabilities in 
CRi-a{V). For the identified set of the model parameter, for example, for each 11 G CRi-a{V) 
we solve 



B* 



'{u) = |/3 G B : 3g G Q s.t. J c (y^ \ x\a,p) dQk{a) = n*„ v(j, A;),n* = n*(n)| . 

(32) 

Denote the resulting confidence region as 

CRi^aiB*) = {B*{U) : n e CRi-aiV)}. 

We may interpret this set as a confidence region for the set B* collecting all values f3* that are 
compatible with the best approximating model V* ■ Under correct specification, B* is just the 
identified set B. 

If we are interested in bounds on marginal effects, for each 11 G CRi-aiV) we get 

/x*(n)= min u*(An), 7Z^(n)= max 71^ (/3,n), A; = 

— « /3GB*(n)— i3eB*(n) 

Denote the resulting confidence regions as 

CRi.a[[il,rk] = {[M*(n),7i^(n)] : n g Ci?i_„(p)}. 

These sets are confidence regions for the sets [/x*,/!)^], where /U* and Ji^ are the lower and upper 
bounds on the marginal effects induced by any best approximating model in {B*,V*). Under 
correct specification, these will include the upper and lower bounds on the marginal effect [m^, /Z^.] 
induced by any true model in {B,V). 

In a canonical projection method we would implement the second stage by simply intersecting 
CRi-ai'P) '^it'^ but this may give an empty intersection either in finite samples or under 
misspecification. We avoid this problem by using the projection step instead of the intersection, 
and also by re-targeting our confidence regions onto the best approximating model. In order to 
state the result about the validity of our modified projection method in large samples, let A be 
the set of vectors with all components bounded away from zero by some e > 0. 

Theorem 14: Suppose Assumption 1 holds, then for (any sequence of true parameter values) 

' V G CRi^a{V) 

lim Prpo \ B* G CRi-JB*) } = 1 - a. 
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8.2 Perturbed Bootstrap 

In this section we present an approach that appears to be sharper than the projection method, 
at least in large samples, when a particular single marginal effect is of interest. Our estima- 
tors for parameters and marginal effects are obtained by nonlinear programming subject to 
data-dependent constraints that are modified to respect the constraints of the model. The dis- 
tributions of these highly complex estimators are not tractable, and are also non-regular in the 
sense that the limit versions of these distributions do not vary with perturbations of the DGP 
in a continuous fashion. This implies that the usual bootstrap is not consistent. To overcome 
all of these difficulties we will rely on a variation of the bootstrap, which we call the perturbed 
bootstrap. 

The usual bootstrap computes the critical value - the a-quantile of the distribution of a 
test statistic - given a consistently estimated data generating process (DGP). If this critical 
value is not a continuous function of the DGP, the usual bootstrap fails to consistently estimate 
the critical value. We instead consider the perturbed bootstrap, where we compute a set of 
critical values generated by suitable perturbations of the estimated DGP and then take the 
most conservative critical value in the set. If the perturbations cover at least one DGP that 
gives a more conservative critical value than the true DGP does, then this approach yields a 
valid inference procedure. 

The approach outlined above is most closely related to the Monte-Carlo inference approach 
of Dufour (2006); see also Romano and Wolf (2000) for a finite-sample inference procedure for 
the mean that has a similar spirit. In the set-identified context, this approach was first applied 
in the MIT thesis work of Rytchkov (2007); see also Chernozhukov (2007). 

Recall that the complete description of the DGP is provided by the parameter vector 
(n',n^')', where U = {U,k,j = l,...,J,k = 1,...,K)', = {Uk,k = l,...,i^)', = Pt{Y = 
Y^X = X''), and 11^ = Pt{X = X'^). The true value of the parameter vector is {V\V^') and 
the nonparametric empirical estimate is (P', P^')' . As before, we condition on the observed 
distribution of X and thus set 11^ = and = P^ . 

We consider the problem of performing inference on a real parameter 9*. For example, 6* 
can be an upper (or lower) bound on the marginal effect such as 

9*iU) = max D'^ [ [F{x'/3 + a) - F(x'/3 + a)]Qk {da) s.t. Cj^P, Qk) = n*^, Vj, 

where 11* = (11*^;^, j = 1, J,k = 1, K) denotes the projection of 11 onto the model space, as 
defined in (j3ip . and B*ijl) is the corresponding projection for the identified set of the parameter 
defined as in (j32p . Alternatively, 9* can be an upper (or lower) bound on a scalar functional 
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d (3* of the parameter /3*. Then we define 

e*{Jl) = max c'/3. 
/3eB*(n) 

As before, we project 11 onto the model space in order to address the problem of infeasibiUty of 
constraints defining the parameters of interest under misspecification or samphng error. Under 
misspecification, we interpret our inference as targeting the parameters of interest in the best 
approximating model. 

In order to perform inference on the true value 9* = 9*{V) of the parameter, we use the 
statistic 

Sn = 9 — 9* , 

where 9 = 9*{P). Let Gn{s, 11) denote the distribution function of -S'„(n) = 9 — ^*(n), when the 
data follow the DGP 11. The goal is to estimate the distribution of the statistic Sn under the 
true DGP 11 = 7^, that is, to estimate Gn{s,V). 

The method proceeds by constructing a confidence region CRi-^{P) that contains the true 
DGP V with probability 1 — 7, close to one. For efficiency purposes, we also want the confidence 
region to be an efficient estimator of V, in the sense that as n — 00, dH{CRi-j{V),V) = 
Op (n^/^), where dn is the Hausdorff distance between sets. Specifically, in our case we use 

Ci?i_^(P) = {n G Sa, : W{U,P) < ci_^(x^(^_i))}, 

where ci-j{xj^(^j_i^) (-*- ~ 7)-quantile of the Xk{j~i) distribution and W is the goodness- 

of-fit statistic: ^ 

{Pjk — ^jk) 



■ 1 — jk 

Then we define the estimates of lower and upper bounds on the quantiles of Gn{s,V) as 

G-\a,r)/G-\a,P)= inf/ sup G-\a,U), (33) 

where G~^{a, 11) = inf{s : Gn{s, 11) > a} is the a-quantile of the distribution function 11). 
Then we construct a (1 — a — 7) • 100% confidence region for the parameter of interest as 

CRi-a-'y{9*) = [9,'9\ 

where, for a = cti + 02 , 

9 = 9-G~\l-ai,V), 9 = 9-G~\a2,V). 

This formulation allows for both one-sided intervals (either ai = or 02 = 0) or two-sided 
intervals {a\ = a2 = a/2). 
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The following theorem shows that this method delivers (uniformly) valid inference on the 
parameter of interest. 

Theorem 15. Suppose Assumption 1 holds, then for (any sequence of true parameter values) 

lim Vv-pJe* e r^,^ ) > 1 - a - 7. 

In practice, we use the following computational approximation to the procedure described 
above: 

1. Draw a potential DGP = (0;^, H^^), where H^fc ~ M{nPk, {Pik, Pjk))/{nPk) 
and A4 denotes the multinomial distribution. 

2. Keep 11^ if it passes the chi-square goodness of fit test with respect to P at the 7 level, 
using K{J — 1) degrees of freedom, and proceed to the next step. Otherwise reject, and 
repeat step 1. 

3. Estimate the distribution Gn{s,Ilr) of Sn(n.r) by simulation under the DGP 11^. 

4. Repeat steps 1 to 3 for r = 1, ...,R, obtaining G„(s, 11^), r = 1, ...,R. 

~ -1 ^-1 

5. Let {a,V)/G^ (a,P) = min/max{G~^(a, III), G„^(a, Ilij)}, and construct a 1 — 
a — 7 confidence region for the parameter of interest as CR\-a-'y{0*) = \_0_,'9\, where 
= — (1 — ai,V), 6 = 9 — G^^ {a2,V), and ai + 02 = a. 

The computational approximation algorithm is necessarily successful, if it generates at least 
one draw of DGP 11^ that gives more conservative estimates of the tail quantiles than the true 
DGP does, namely [G-\a2,V),G-^{l - ai,V)] C \G-\a2,Ur),G~\l - ai,Ur)]. 

9 Empirical Example 

We now turn to an empirical application of our methods to a binary choice panel model of female 
labor force participation. It is based on a sample of married women in the National Longitudinal 
Survey of Youth 1979 (NLSY79). We focus on the relationship between participation and the 
presence of young children in the years 1990, 1992, and 1994. The NLSY79 data set is convenient 
to apply our methods because it provides a relatively homogenous sample of women between 25 
and 33 year-old in 1990, what reduces the extent of other potential confounding factors that may 
affect the participation decision, such as the age profile, and that arc more difficult to incorporate 
in our methods. Other studies that estimate similar models of participation in panel data include 
Heckman and MaCurdy (1980), Heckman and MaCurdy (1982), Chamberlain (1984), Hyslop 
(1999), Chay and Hyslop (2000), Carrasco (2001), Carro (2007), and Fernandez- Val (2008). 
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The sample consists of 1,587 married women. Only women continuously married, not stu- 
dents or in the active forces, and with complete information on the relevant variables in the entire 
sample period are selected from the survey. Descriptive statistics for the sample are shown in 
Table 2. The labor force participation variable (LFP) is an indicator that takes the value one if 
the woman employment status is "in the labor force" according to the CPS definition, and zero 
otherwise. The fertility variable (kids) indicates whether the woman has any child less than 3 
year-old. We focus on very young preschool children as most empirical studies find that their 
presences have the strongest impact on the mother participation decision. LFP is stable across 
the years considered, whereas kids is increasing. The proportion of women that change fertility 
status grows steadily with the number of time periods of the panel, but there are still 49% of 
the women in the sample for which the effect of fertility is not identified after 3 periods. 

The empirical specification we use is similar to Chamberlain (1984). In particular, we esti- 
mate the following equation 

LFPit = 1 {/? • kidsit + ai + eu > 0} , (34) 

where ai is an individual specific effect. The parameters of interest are the marginal effects 
of fertility on participation for different groups of individuals including the entire population. 
These effects are estimated using the general conditional mean model and semiparametric logit 
and probit models described in Sections 2 and 5, together with linear and nonlinear fixed ef- 
fects estimators. Analytical and Jackknife large-T bias corrections are also considered, and 
conditional fixed effects estimates are reported for the logit modeljf] The estimates from the 
general model impose monotonicity of the effects. For the semiparametric estimators, we use 
the algorithm described in the appendix with penalty A„ = l/(n log n) and iterate the quadratic 
program 3 times with initial weights wjk = nP^. This iteration makes the estimates insen- 
sitive to the penalty and weighting. We search over discrete distributions with 23 support 
points at { — cxd, — 4, — 3.6, 3.6, 4, oo} in the quadratic problem, and with 163 support points 
at {— oo, — 8, — 7.9, 7.9, 8, oo} in the linear programming problems. The estimates are based 
on panels of 2 and 3 time periods, both of them starting in 1990. 

Tables 3 and 4 report estimates of the model parameters and marginal effects for 2 and 3 
period panels, together with 95% confidence regions obtained using the procedures described 
in the previous section. For the general model these regions are constructed using the normal 
approximation (95% A^) and nonparametric bootstrap with 200 repetitions (95% B). For the 
logit and probit models, the confidence regions are obtained by the modified projection method 
(95% MP), where the confidence interval for V in the first stage is approximated by 50,000 

^The analytical corrections use the estimators of the bias based on expected quantities in Fernandez- Val (2008) . 
The Jackknife bias correction uses the procedure described in Hahn and Newey (2004). 



28 



DGPs drawn from the empirical multinomial distributions that pass the goodness of fit test; 
and the perturbed bootstrap method (95% PB) with R = 100, 7 = .01, ai = a2 = -02, and 200 
simulations from each DGP to approximate the distribution of the statistic. We also include 
confidence intervals obtained by a canonical projection method (95% MP) that intersects the 
nonparametric confidence interval for V with the space of probabilities compatible with the 
semiparametric model S: 

CRi.a{V) = {n e s : win,p) < ci_a(xK(j-i))} • 

For the fixed effects estimators, the confidence regions are based on the asymptotic normal 
approximation. The semiparametric estimates are shown for e„ = 0, i.e., for the solution that 
gives the minimum value in the quadratic problem. 

Overall, we find that the estimates and confidence regions based on the general conditional 
mean model are too wide to provide informative evidence about the relationship between par- 
ticipation and fertility for the entire population. The semiparametric estimates seem to offer a 
good compromise between producing more accurate results without adding too much structure 
to the model. Thus, these estimates are always inside the confidence regions of the general 
model and do not suffer of important efficiency losses relative to the more restrictive fixed ef- 
fects estimates. Another salient feature of the results is that the misspecification problem of 
the canonical projection method clearly arises in this application. Thus, this procedure gives 
empty confidence regions for the panel with 3 periods. The modified projection and perturbed 
bootstrap methods produce similar (non-empty) confidence regions for the model parameters 
and marginal effects. 

10 Possible Extensions 

Our analysis is yet confined to models with only discrete explanatory variables. It would be 
interesting to extend the analysis to models with continuous explanatory variables. It may be 
possible to come up with a sieve-type modification. We expect to obtain a consistent estimator 
of the bound by applying the semiparametric method combined with increasing number of par- 
titions of the support of the explanatory variables, but we do not yet have any proof. Empirical 
likelihood based methods should work in a straightforward manner if the panel model of interest 
is characterized by a set of moment restrictions instead of a likelihood. We may be able to 
improve the finite-sample property of our confidence region by using Bartlett type corrections. 
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11 Appendix 



11.1 Proofs 

Proof of Theorem 1: By eq. 

Y^iXt - r'')E[Yit I X, = X^] = Tr\l - r^) j m(l, a)Ql{da) (35) 
t 

+r(l-r'=)(-r^) j miO,a)QUda)=Talf,,. 
Note also that Xi = when Xi = X^ . Then by the law of large numbers, 

Y^{X.u-X^'/nJUE[Y^{X.u-m = Y.^kY.i^'l - r'f = Y.^,Tal 

i,t t k t k 

Y^{Xu-X,)Yu/n^E[Y^{Xu-Xi)Yu] = Y^V^Y^iX^ - r^)E%t\ X, = X^] 

i,t t k t 

k 

Dividing and applying the continuous mapping theorem gives the result. Q.E.D. 

Proof of Theorem 2: The set of X^ where f j > and f j > coincides with the set for 
which Xi = X^ for /c E /C*. On this set it will be the case that fj and fj are bounded away 
from zero. Note also that for i such that X^ = x we have E[Y-f: \ Xi = X^] = J m{x,a)Q'^{da). 
Therefore, for = ^{t : X^ = x}/T and = i^{t : X^ = x}/T, by the law of large numbers, 

-Vl(f. > 0)l(r-. > 0){^%^i^ - 

^ Em>o)i{n>o){^^^-^^^^}]/D 

I ri I ri 

Tri Tri 
Tf^ jrn{i,a)Ql{da) Tr^ j m{x,a)Ql{da) ^ 

= i^ki \/D = 2^ Vk^ik. 

k&K* k&K* 

1 " 

- Y > o)i(f, > 0) ^ E[i{n > o)i{fi > 0)] = ^ Vk- 

i=i fee/c* 
Dividing and applying the continuous mapping theorem gives the result. Q.E.D. 

Proof of Lemma 3: As before let Ql{a) denote the conditional CDF of a given Xi = X^. 
Note that 

_ E[Yit I Xi = X^] _ Jm{X^,a)Ql{da) 
" D " D • 
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Also we have 

J m{x, a)Ql{da) J m{x,a)Ql{da) 



^^k = j ^{a)Ql{da) 



D D 



Then if there is t and t such that X = x and Xj = x 

t t 

-k -fc _ /"T'(*>a)Qfc(c^a) Jm{x,a)Ql{da) _ 
'^t ~ Jj ^ l^k- 

Also, if Bi < m{x, a)/D < B^, then for each k, 

Jm{i,a)Ql{da) J m{x, a)Ql{da) 
-D£ S S —rSu S ^ S — -D£ 

Then if there is i such that X'^ = x we have 

t 

u _ Jm{x,a)Ql{da) Jm{x,a)Ql{da) _ -k p, 

— LSu — Bu < ^ B£ — — Bi. 

The second inequality in the statement of the theorem follows similarly. 

Next, if A(a) has the same sign for all a and if for some k* there is t and t such that 
X^* = X and X^* = x, then sgn{A{a)) = sgn{^f^*). Furthermore, since sgn{ii^.) = sgn{ii^.,) is 
then known for all A;, if it is positive the lower bounds, which are nonpositive, can be replaced by 
zero, while if it is negative the upper bounds, which are nonnegative, can be replaced by zero. 
Q.E.D. 

Proof of Theorem 4: See text. 

Proof of Theorem 5: Let ZiT = min{X;f^^ l{Xit = x)/T,Y,J^^ l^Xu = x)/T}. Note 
that if ZiT > then l(yljr) = 1 for the event AiT that there exists t such that X-^ = x and 
Xjf = X. By the ergodic theorem and continuity of the minimum, conditional on ai we have 
ZiT b{ai) = min{Pr(Xj( = x \ aj),Pr(Xjt = x \ ai)} > 0. Therefore Pr(AjT I "i) > 
Fi{ZiT > \ ai) — > 1 for almost all ai. It then follows by the dominated convergence theorem 
that 

Pr(^,T) = E[Pv{AiT I a.)] 1. 
Also note that Pi{AiT) = 1 - V° - Y.kek '^k - EkeR 'Pk, so that 

- ^ol < (Bu - Bi){V^ +Y,Vk + Y._Vk)-^ O.Q.E.D. 

k&k keK 

Proof of Theorem 6: Let Vi and Vk be as in equation ([6]). By the Markov assumption, 
Vi = Pi{Xa = --- = XiT = 0)= E[PT{Xi^ = ... = XiT = 0\ ai)] 

= ^[nf=j+i Pr{Xu = I X,,t-i = ■■■ = Xi^t-j = 0, a,) Pr(Xa = • • • = Xi^t-J = | ai)] 

< E[{p\)^-\ 
Vk < E[{pf)^-\ 
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The first bound then follows as in ([6]). The second bound then follows from the condition 
< 1 — e for k £ {1, K}. Now suppose that there is a set A of possible ai such that Pr(A) > 0, 
Qi = Pr(Xii = ■■■ = Xij = 0\ ai) > and pj = 1 Then 



Vi = E[{pjy-'Qi] > E[l{ai e A){piy-'Q,] = E[l{ai G A)Qi] > 0. 

Therefore, for all T the probability Vi is bounded away from zero, and hence //^ /ig or 
fJ'u /io-Q-E.D. 

Proof of Theorem 7: Note that every G A'*(x) has = x. Also, the for s > i 
are completely unrestricted by Xi G X^{x). Therefore, it follows by the key implication that 

E[Yit I Xi G X\x)] = j m{x,a)Q*{da \ Xi G X\x)). 

Then by iterated expectations, 

J m{x , a)Q* (da) = V (x) J ■m{x,a)Q*{da \ Xi £ X{x)) 

T 

+ ^Pr(X, G X\x)) j m{x,a)Q*{da \ X, G X\x)) 

T 

P{x) j m(x, a)Q*{da \ Xi G X{x)) + E^ l{Xi G X\x))Yit]. 

t=i 



t=i 



Using the bound and dividing by D then gives 
T 



E^l{X,e X\x))Yit]/D + P{x)Bt < f m{x,a)Q*ida)/D 
t=i 

T 

< E[Y, ^X^ G X\x))Yu]/D + P{x)Bu. 
t=i 

Differencing this bound for x = x and x = x gives the result. Q.E.D. 

Proof of Theorem 8: The size of the identified set for the marginal effect is 
7I.-/i = max D-^ [F(p + a)-F(a)]Qk(da)- min D'^ [F ((3 + a)-F (aMQJda), 

where Q^p = {Qk '■ f ^ {Y^ \ X^, a, j3) Qk {da) = Vjk, j = 1, J}- The feasible set of distribu- 
tions Qfc^ can be further characterized in this case. LetFT(/3,a) := (1, /3+q), 
a)) and J^j{(3, a) denote the J x 1 power vector of Ft{(3, a) including all the different products 
of the elements of Ft{(3, a), i.e., 

T 

J^j{(3, a) = (1, F{X^(3 + a),F{X^P + a)F(X|/3 + a), J] H^tP + «))• 

t=i 



32 



Note that L {Y^ \ = n^i F{X\^^aft {\-F{X\^^a)Y-^i , so the model probabil- 

ities are linear combinations of the elements of Tj{(3,a). Therefore, for 11^ = (T^ife, ■■■,Vjk) we 
have Qk/j = {Qh '■ Aj J (x)Qk (da) = 11^}, where Aj is a J x J matrix of known constants. 

The matrix Aj is nonsingular, so we have: 



Qfe/3 = l^fc : J J'jiP,a)Qk (da) = M^j , 



where the J x 1 vector Mj. = Aj^TLk is identified from the data. 

Now we turn to the analysis of the size of the identified sets. We focus on the case where 
k = 1, i.e., X*^ is a vector of zeros, and a similar argument applies to k = K. For A; = 1 we have 
that F{X^P + a) = F{a) for all t, so the power vector only has T + 1 different elements given 
by (1, F{a), F{a)^). The feasible set simplifies to: 

= |Qfc : J FiaYQk (da) = Mkt, t = 0, ...,r| , 

where the moments M^^ are identified by the data. Here / F{a)Qk (da) = M^i is fixed in Qkp, 
so the size of the identified set is given by: 

^fe-'"fc = o / F{P + a)Qk{da)- min D'^ [ F {P + a) Qk{da). 

By a change of variable, Z = F{a), we can express the previous problem in a form that is 
related to a Hausdorff truncated moment problem: 

Til. — u, = max / haiz^Gkidz) — min / hB(z)Gk(dz), (36) 

where Gtp = {Gfc : /q z^Gk{dz) = Mkt, t = 0, ...,r}, h^iz) = F{p + F-\z)), and F"! is the 
inverse of F. 

If the objective function is r times continuously differentiable, hg G ^^'[0, 1], with uniformly 
bounded r-th derivative, ||/ij^(-z)||oo ^ h^, then we can decompose hf^ using standard approxi- 
mation theory techniques as 

hf3{z) = Pf3{z,T)+Rf3{z,T), (37) 

where Pp{z,T) is the T-degree best polynomial approximation to hp and Ri3{z,T) is the re- 
mainder term of the approximation, see, e.g., Judd (1998) Chap. 3. By Jackson's Theorem the 
remainder term is uniformly bounded by 

\mz,T)\\^ < {ly-h}, = O (r-0 , (38) 

as r — GO, and this is the best possible uniform rate of approximation by a T-degree polynomial. 
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Next, note that for any Gk € Gki3 we have that Pp{z ,T)Gk{dz) is fixed, since the first T 
moments of Z are fixed at Qkp- Moreover, Pp(z,T)Gk{dz) is fixed at B if the parameter is 
point identified, B = {/3*}. Then, we have 

JIk-l£k = ^"^^^ C Rp*{z,T)Gk{dx) - min [' R^,{z,T)Gk{dx) < 2/i^* = O {T~') . (39) 

To complete the proof, we need to check the continuous differentiabihty condition and the 
point identification of the parameter for the logit model. Point identification follows from Cham- 
berlain (1992). For differentiability, note that for the logit model 

"^(^i = 1 (^°> 

with derivatives 

hUz)=r\ , \ L , ■ (41) 

These derivatives are uniformly bounded by = r\ el^l(el'^l — 1|)^^^ < oo for any finite r. 
Q.E.D. 

Proof of Theorem 9: Note that for T = 2 and X binary, we have that K = A. Let 
= (0,0), ^2 = (0, 1), X^ = (1,0), and = (1, 1). By Lemma 3, /i^ is identified by 

=P2*[Pr{y = (0,l) |X2}-Pr{y=(l,0) |X2}]+P3*[Pr{y=(l,0) |X3}-Pr{y = (0,l) 
The probability limit of the fixed effects estimator for this effect is 

3 

k=2 

The condition for consistency Jlj = Hj can be written as 

V2 Pr{y = (0, 1) I X2} + Vs Pr{y = (1, 0) I 



F(/3/2) 



Ei=2n[MY = (0,1) \ x''} + ft{y = (1,0) \ xk}y 

but this is precisely the first order condition of the program (|16|) . This result follows, after some 
algebra and using the symmetry property of F, by solving the profile problem 

K 

/3 = argmaxEpfc[Pr{y = (0,1) | X'^} log F(AX'=/3/2)+Pr{y = (1,0) | X^} log F(-AX'^/3/2)], 

^ k=i 

where AX'' = X| - X^. Q.E.D. 

Proof of Lemma 10: First, hy P £ B, we have that T{[3]V) = and therefore any 
Qk(3 G argmaxQ^ 'E'Li^jki'P) CPjk - C,jk{P,Qk)f satisfies Cjk{fi,Qkp) = 'Pjk Vj, for each k. 
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Let the vector of conditional choice probabihties for (Y^, ....jY'^) be 

A {a, 13) = [C (y^ I X\ a, /?) (y' | X\ a, /?))' . 

Let Tk{l3) = {£fc a) : a G C}. Note that, for each /? € 5, [(3) is a closed and bounded set 
due to compactness of C, and has at most dimension J — 1 since the sum of the elements of 
Ck (/?, a) is one Va. Now, let J^k (/?) denote the convex hull of (/3). For any P G B we have 
that there is at least one Qkp such that Cjk{P, Qkp) = 'Pjk Vj) i.e., 

{Vik,-,Vjk)eMk (/?). 

By Caratheodory Theorem any point in M.k {(3) can be written as a convex combination of at 
most J vectors located in Fj^ {(3). Then, we can write 

J 

(^Ife, ■■■,'Pjk) = X] ^fe^i'^fe ("fem,/3) , 
m=l 

where {iTki, T^kj) is on the unit simplex S of dimension J. Thus, the discrete distribution with 
J support points at (ofci, a^j) and probabilities (vrfci, tt^j) solves the population problem 
for Qkfi- The result also follows from Lindsay (1995, Theorem 18, p. 112, and Theorem 21, p. 
116) (though Lindsay does not provide proofs for his theorems). Q.E.D. 

Proof of Lemma 11: For [3 e B, let Qkg = {Qk ■ jCjk{P,Qk) = 'Pjk, 3 = l,-.-,^}- Let 
Qkp € Qfe/3 denote some maximizing value such that 

Tikg = D-^ [ [F {x'f3 + a)-F {x'f3 + a)]Qkf^ (da) . 
Jc 

Note that, for any e > we can find a distribution Qjf^ G Qkp with a large number M 3> J of 
support points (ai, ...,aM) such that 

JIkp-e< D-^ [ [F {x'(3 + a)-F {x'(3 + a)]g^ (da) < /Z^^. 

JC 

Our goal is to show that given such it suffices to allocate its mass over only at most J 
support points. Indeed, consider the problem of allocating (tt^i, ...■,'KkM) among (cti, ...,Q!m) in 
order to solve 

M 

max ^[F ix' (3 + am) - F [x' (3 + am)]'Kkm 

(7rfci,...,7rfcM)^^^ 

subject to the constraints: 

TTfcm > 0, m=l,...,M 

M 

TTkmjC. (y^ I X\ am, p) = Vjk, j = 1, J, 

m=l 

M 

T^km = 1- 

m=l 
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This a linear program of the form 

max c'vr such that vr > 0, An = 6, I'vr = 1, 



and any basic feasible solution to this program has M active constraints, of which at most 
rank {A) + 1 can be equality constraints. This means that at least M — rank(^) — 1 of active 
constraints are the form vr^m = 0, see, e.g.. Theorem 2.3 and Definition 2.9 (ii) in Bertsimas and 
Tsitsiklis (1997). Hence a basic solution to this linear programming problem will have at least 
M - J zcroGS, that is a.t most J strictly positive tt^^t^'s Thus, we have shown that given the 
original Q^j^ with M ^ J points of support there exists a distribution G Qf^p with just J 
points of support such that 

-pkfs-e < D-^ I [F {x(i + a) -F {x'P + a)]Q^ (da) < D'^ [ [F {x'P + a) -F {x'P + a)]Q^^ (da) < 
Jc Jc 

This construction works for every e > 0. 

The final claim is that there exists a distribution Q^^ G Qkp with J points of support 
(afci) •••jOfcj) such that 

T^kp = [ [F {x'p + a)-F {x'p + a)]Qip {da) . 
Jc 

Suppose otherwise, then it must be that 

'Pkp y-pk^-ey D-^ I [F {x'p + a)-F {x'p + a)]Qj^^ {da) , 
Jc 

for some e > and for all with J points of support. This immediately gives a contradiction 
to the previous step where we have shown that, for any e > 0, Jlj^p and the right hand side can 
be brought close to each other by strictly less than e. Q.E.D. 

Some Lemmas are useful for proving Theorem 12. 

Lemma A1: Let T{P, Q; H) = ^a;jfc(n) {Hjk — Cjk{P, Qk))^ ■ If Assumption 1 is satisfied 
then, for Q equal to the collection of distributions with support contained in a compact set C, 

sup \T{P,Q;P)-T{p,Q;V)\=ov{l). 

Proof: Note that we can write 

T{P, Q; P) - T{P, Q-V) = ^ ujjk{P){Pjk - V,k)^ + 2 ^ uJ,k{P){Pjk - Vjk) {Vjk - Cjk{P, Qk)) 

j,k j,k 

+ Y,(''MP) - ^3k{V)) {Vjk - Cjk{p, Qk))^ ■ 



^Note that rank{A) < J — 1, since X]j=i ^ (^"' I j °!i P) ~ 1- The exact rank of A depends on the sequence 
X'' , the parameter /3, the function F, and T. For T — 2 and X binary, for example, rank{A) = J — 2 = 2 when 
Xi = X2, /3 = 0, or _F is the logistic distribution; whereas rank{A) = J — 1 — 3 for Xi ^ , 13 ^ 0, and F is any 
continuous distribution different from the logistic. 



36 



The result then follows from Pjk — Vjk = o-p(l) and ujjk{P) — ujjkiV) = op(l) by the continuous 
mapping theorem. Q.E.D. 

Prom Lemma Al, we obtain one-sided uniform convergence: 

Lemma A2: Let T{(3;Il) = miQ(zQT{P,Q;Il). If Assumption 1 is satisfied then 

sup\T{P;P)-T{P;V)\ = ovil). 

Proof: Let Q/3 £ argmfQ^QT{(3,Q; P) and Qp G arginfggQ T(/3, Q; P). By definition of Q/? and 
(5/3, we have uniformly in /? and for all n, 

T{P, Q/j; P) - T{P, Q/3; V) < T{(5, Q^; P) - T{f3, Qp- V) < T{(5, Qp- P) - T{(5, Qp- V). 

Hence 



T{P,Qp-P)-T{P,Qp-V) 



< max 



T{P,Qp;P)-T{P,Qp-V) , |r(/3, Q/3; P) - r(/3, Q/3; P)| = 



uniformly in f3 by Lemma Al. Q.E.D. 

Lemma A3: If Assumption 1 is satisfied then T{f3;V) is continuous in f3. 
Proof: By Lemma 10, the problem 



inf T(/3,g;P) 



can be rewritten as 



min VwjfelP) 

(aifc,...,ajfe)GC,Vfc 
(7rife,...,7rjfc)GS,Vfe ^''^ 



m=l 



where J and K are finite, and S denotes the unit simplex in M'^. Here, (ai^, . . . , a^-fc) and 
(ttia;, . . . , TTjfc) characterize discrete distributions with no more than J points of support. Because 
the objective function is continuous in (/?, an, . . . , ajx, tth, . . . , ttjk), and because x is 
compact, we can apply the theorem of the maximum (e.g. Stokey and Lucas 1989, Theorem 
3.6), and obtain the desired conclusion. Q.E.D. 

Lemma A4: // Assumption 1 is satisfied then 

sup|r(/3;P)-r(/3;P)| =Op(n-i). 

Proof: Let Qp G argminQgQr(/3, (5;P). By Lemma 10, we have that Vjk = jO,jk{l3,Qkp) and 
r(/3; V) = Oyp e B. Then, we have 

sup |T(/3;P) - T{P;V)\ = supT{P;P) < sup T(/3, Q/j; P) = Y,^Jk{P) {Pjk - Vjuf = Op(n-i), 



Pes 



l3eB 
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where the last equahty follows from Pjh — Vjk = 0-p{n ^/^), u}jk{P) = LOjkiV) + o-p(l) by the 
continuous mapping theorem, and J and K being finite. Q.E.D. 

Proof of Theorem 12. The consistency result under Assumption 1 and e„ oc log n/n 
follows from Theorem 3.1 in Chernozhukov, Hong, and Tamer (2007) with o„ = n. Indeed, the 
Condition C.l in Chernozhukov, Hong, and Tamer (2007) follows by Assumption 1 (B compact). 
Lemma A3 {T[(3;V) continuous). Lemma A2 (uniform convergence of T{P;P) to T{(3]V) in B), 
and Lemma A4 (uniform convergence of T(/3; P) to T(/3; V) in i? at a rate n). 

The consistency result under Assumptions 1 and 2 and e„ = follows from Theorem 3.2 in 
Chernozhukov, Hong, and Tamer (2007) with a„ = n. It is not difficult to show that Assumption 
3.2 implies condition C.3 in Chernozhukov, Hong, and Tamer (2007), which along with other 
conditions verified above, implies the consistency result. 

The second result follows by redefining the estimation problem as 

P* G e„ - arg mm W{Il, P), W{U, P) = J2 ""i^^^) ^^^^ - ^jkf , 

where P* = {Pjj^,j = 1, J,k = 1, K) and S is the space of conditional choice probabilities 
that are compatible with the model. Under Assumption 1, S is compact, the function H 
W{Il, P) is continuous for each P in the neighborhood of V, and therefore T^(n, P) — W(Jl; V) = 
o-p(l) uniformly in H G S, as P = 7^ + o-p(l). Moreover, H W{Il,V) is uniquely minimized 
at n = 7^* by assumption. Therefore, by the consistency theorem for approximate argmin 
estimators, it follows that the e„-argmin P* is consistent for V*. Q.E.D. 

Proof of Theorem 13. We consider the upper bounds only, since the proof for lower 
bounds is analogous. We have that (i) the projection 

n* = n*(n) e argmin Vw;,fe(n)(n,fe - fi^kf 

is continuous at V by the theorem of the maximum, (ii) the parameter space for /3 and U is 
compact, (iii) the function defining the constraints 

J 

(n,/?,Q;fei,...,Q!jfcj,7rfei,....,7rjfcj) '-^^^-'^ I ,akm, P)'^km 

m=l 

is continuous by Assumption 1 and the continuity of the projection, and (iv) the criterion 
function 

J 

(n, P, ttfei, akJ, TTfei, TTjfcj) <-^^[F [x'P + akm) - F [x'(3 + akm)]l^km 

m=l 
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is continuous by the assumed continuity of F. Then, using the theorem of the maximum, we 
conclude that the maximal mapping 

(/3,n)^7i^(An) 

is continuous. By Theorem 12 and the extended continuous mapping theorem we have that 
implies that 

dH{rk{Bn,P).Til{B\V)) 0, 

where 7*^(^)11) = {7I^(/3,n) : /3 G A). The conclusion of the theorem then immediately follows. 
Q.E.D. 

Proof of Theorem 14: By the uniform central limit theorem, P) converges in law 

to ')dj(K-\) under any sequence of true DGPs with in A. It follows that 

lim Vx-pAV e CRi-a{V)} = 1 - a. 

n— >oo 

Further, the event V € CRi-aiV) implies event V* e {n*{U) : U G CRi-aCP)} by construction, 
which in turn implies the events B* G CRi-a{B*) and [Ai*,^^] G Ci?i_a[At*,7Ifc], V/c. Q.E.D. 

Proof of Theorem 15. We have that for Sn{V) = 9 - 9* = 9 - 9*{V) 

< PrvoiiSnir) t [G-i(a2,P),G;'(l - ai,P)]} n G CRi_^(P)}] 

+Pr7.„{P CRi-^(P)} 

< PrpJ{5„(P) [G;i(a2,P),G„Hl " n G CRi_^(P)}] 

+Pr7,jP CRi_^(P)} 

< Vxv,{Sn{V) ^ [G-i(a2,7'),G-i(l - ai,7')]} + Prp„{7' CRi_^(7')} 
<a + Prpo{:P0CRi_^(:P)}. 

Thus if limsup^PrpJ-p CRi-^CP)} < 7, we obtain that lim„Prpo{6'o ^ [^,^]} < a + 7, 
which is the desired conclusion. 

It now remains to show that lim sup^^^^ Prpg {7^ ^ CRi_-y(P)} < 7. We have that 

V^vAT ^ CRi_,(P)} = PrpJl^(P,P) > ci_^(x^(^_i))}. 

By the uniform central limit theorem, W{V, P) converges in law to X^(j_i) under any sequence 
Vo in A. Therefore, 

Jim Prp„{W(P,P) > ci^j{Xk(j-i))} = MXk(j-i) > ci-t(x^(j-i))} = 7- 

Q.E.D. 
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11.2 Generic Uniqueness of Projections of Probabilities onto the Model Space 



The following lemma is motivated by the analysis of Newey (1986) on generic uniqueness of 

quasi-maximum likelihood population parameter values. 

Lemma A5. Let Q be a set of vectors H = = 1, ...,k,j = 1, J) > that satisfy the 

rr .. — 1 h . 



system of linear constraints X]/=i njfe = 1; k = 1,...,K. Let proj(n) = arg minnes ^(n, H') 



where (i(n, 11') = J2k=i 'l2j=i^jk{^jk — ^jk) j be the projection ofU on the set E, where 
^jk > for all (j, k) are weights normalized so that d is a proper distance, and H = {H(/?), /3 S B} 
where B is compact and H(/3) = {H S S^v : (J^ik, ■■■^Jk)' £ rfc(/?), VA;}, where Tk is defined as in 
Section 7, with link function F being twice continuously differentiable. The set Qq = {11 G Q : 
proj(n) is unique} is an open dense subset ofQ. 

Proof: We first note that H is compact, 11' ^ d{Yl^ 11') is continuous, so that the minimum 
is attainable, and the projection exists. The rest of the proof has two steps: verification of 
openness of Qq and verification of denseness of Qq relative to Q. 

To verify openness, we take Ho G Qq and find an open neighborhood M of Ho in Q such that 
M C Go- We consider two cases. First, if proj(no) is in the interior of H, then there exists an 
open neighborhood M' of Ho in H. For each H in M, we necessarily have that proj(n) = IT, 
so we can take M = N' . Second, if Hq is on the boundary of H, the verification follows by an 
argument similar to that given by Newey (1986), p. 7. 

To verify denseness, we take Hq G ^ \ Qq, so that proj(no) is not unique. For this to happen 
it must be that IIo H. Take any element IIq of proj(no). Then we can construct a sequence 
lin approaching Ho such that proj(nn) = IIq, so that n„ G Qq. Indeed, simply take 

n„ = -lii + ^^no. 

n n 

Clearly, n„, G Q and it approaches IIq. Also, note that by definition Ilg is a point of intersection 
of H with the contour set or ellipse Co = {H' G Q : d(no, 11') = t} for t = miuj^^^ d{IiQ,H). Also, 
note that the contour set or sphere Cn = {H' G Q : d(n„, 11') = t'}, where t' = miujj^- d(n„, 11) 
is a strict subset of the sphere Co, since by convexity of the distance 

1 n — 1 n — 1 

t' < d{Un,u*) < -a!(n*,n*) + d(no,n*) = 1 < t, 

n n n 

with only one common point C„ Pi Co = Ilg G H. This establishes that proj(n„) = IIq. Q.E.D. 

11.3 Computation 

The quadratic problem (j26p can be solved using computational techniques developed for finite 
mixture models such as the EM algorithm or vertex direction methods, see, e.g.. Laird (1978), 



40 



Bohning (1995), Lindsay (1995, Chap. 6) and Aitkin (1999). These iterative algorithms, how- 
ever, are sensitive to initial values and can be very slow to converge in this problem where we 
estimate several mixtures over a grid of values for j3. Moreover, a slow algorithm is specially 
inconvenient for the resampling based inference that we develop in Section 8. The main compu- 
tational difficulty in the mixture problems is to find the location of the support points; see, e.g., 
Aitkin (1999). Since the mixtures are nuisance parameters in our problem, we propose solving 
the following penalized quadratic problem: 



T),{f3-P) = minV 



M \ 2 M 



"^jk Pjk - Yj ^^"'''^ I ^^ ^)] +^nj2TTi 



.2 

km 

m=l J m=l 



(42) 



hk 
L 

S.t. ^ -Kkm = 1, T^km > 0, Vj, k. 
m=l 

where M is large and A is small. For the weights, we set Wjk = nPk/ ^^^^iTTkmC-iX^ I 
,am, where (/?, {vTfcm, V(/c,m)}) is an initial estimate. 

The above program is a convex quadratic programming problem for which there are reli- 
able algorithms to find the solution in polynomial time; see, e.g., the quadprog package in R 
(Weingessel, 2007). The penalty A„ acts choosing a distribution among the set of discrete dis- 
tributions with support contained in a large grid {ai,...,aM}- In general there is an infinite 
number of solutions for Qk, one of them is a discrete distribution with no more than J « M 
support points by Lemma 10. Here, instead of searching for the solution with the minimal sup- 
port, we search over discrete distributions with support points contained in a large partition of 
the parameter space C. By making the partition fine enough we guarantee to cover a solution 
to the problem, without having to find explicitly the location of the support points. The error 
of the finite grid approximation approaches zero as M ^ cxd if C is compact and the objective 
function has boundable variation with respect to Om', see, e.g., Lindsay (1995; Chap. 6). The 
penalty favors distributions with large supports. This regularization therefore addresses the 
computational difficulties created by the non-identifiability of Q^. 

The final estimates of the identified sets for the parameters and marginal effects are computed 
by solving the linear programming problems (|24p and (j25p for all the parameter values /3 which 
satisfy the condition Tx{(3; P) < miu/j Ta(/3; P) + e^, and replacing the VjkS by the probabilities 
predicted by the model -Pj^'s for this parameter value /?, defined as in ([28l) . 
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